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Abstract — Robust visual tracking for long video sequences is a research area that has many important applications. The main 
challenges include how the target image can be modeled and how this model can be updated. In this paper, we model the target 
using a covariance descriptor, as this descriptor is robust to problems such as pixel-pixel misalignment, pose and illumination 
changes, that commonly occur in visual tracking. We model the changes in the template using a generative process. We 
introduce a new dynamical model for the template update using a random walk on the Riemannian manifold where the covariance 
descriptors lie in. This is done using log-transformed space of the manifold to free the constraints imposed inherently by positive 
semidefinite matrices. IVIodeling template variations and poses kinetics together in the state space enables us to jointly quantify 
the uncertainties relating to the kinematic states and the template in a principled way. Finally, the sequential inference of the 
posterior distribution of the kinematic states and the template is done using a particle filter. Our results shows that this principled 
approach can be robust to changes in illumination, poses and spatial affine transformation. In the experiments, our method 
outperformed the current state-of-the-art algorithm - the incremental Principal Component Analysis method |3£1, particularly 
when a target underwent fast poses changes and also maintained a comparable performance in stable target tracking cases. 

Index Terms — Tracking, Particle filtering. Template update. Generative Template Model, Riemannian manifolds, log-transformed 
space. 
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1 Introduction 

Visual tracking is an important vision research topic 
that has many applications, ranging from motion- 
based recognition |7|, surveillance flSl, human- 
computer interaction [lOj, etc. It also covers many 
aspects of computer vision problems, such as target 
feature representation f46l, feature selection |9|, and 
feature learning |15|. Even though it has been actively 
researched for decades, many challenges remain espe- 
cially with changes in target poses and appearance, 
and illumination in a long video sequence. Figure [T] 
shows two simple examples of how a target can vary 
over a short time interval. Often these challenges are 
common and require a good solution in order for 
long stable tracking in many real life tasks. There 
are generally three common approaches to deal with 
target appearance variations. First is to use robust or 
invariant target features such as scale invariant feature 
transformation and color histogram |3|. However, as 
shown by Figure [l] target appearance can change 
significantly over time, and end up totally different 
from the starting frame due to variations in target 
poses and image illumination. The second approach 
is to employ a complete set of possible target mod- 
els fll, aiming to model possible target variations. 



Marcus Chen and Cham Tat Jen are with the Department of School 
of Computer Engineering, Nanyang Technological University, is with 
the Department. 

Pang Sze Kim and Alvina Goh are with DSO National Laboratories, 
Singapore. 





Fig. 1 . Target patches for successive 871 frames, from 
#1,31, ...871 from 2 video sequences. Target changes 
in both illumination, poses, appearances even after 
being affine warped to a standard size. 



However, this requires learning of the target model 
in advance and can hardly be scalable. Finally, the 
last approach is to update the template gradually 
as it evolves. Note that in this paper, we loosely 
use the term template for target representation, and 
do not strictly limit to the image patches. There are 
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several choices for a target template found in the 
literature. For example, |38| uses the histogram of 
oriented gradients, while I^SJ uses the color histogram, 
| [45| LI sparse representation, p3) active appearance 
model, 1 34 1 principal subspace of image patches, and 
1 31 1 features covariance. 

The template update problem can be expressed 
mathematically as Eqn. 



Tt 



(1) 



where Tt^Tt^t G [1,2,...] are the estimated and up- 
dated templates respectively at time t. However, as 
shown in ||23j, target template updating is a challeng- 
ing task. According to |23|, if the template was not 
updated at all, the template would become outdated 
shortly and cannot be used for matching as the target 
appearance would have undergone changes tempo- 
rally. On the other hand, update at every frame would 
result in accumulation of small errors, and eventually 
a template drift and loss target information. 

Recognizing the importance of template update, 
many methods have been proposed. One common 
and intuitive approach is to use linear updating func- 
tion in the respective feature spaces, such as [31 [ 
on the covariance manifold. This will smoothen the 
changes between the estimated Template and updated 
template. Similarly, Kalman filter has also been used 
in p5| to track template features variables, but not 
target trajectory. On the other hand, there are three 
well-known template update algorithms in the litera- 
ture, namely template alignmen t p3| . Online Expec- 
tation and Maximization (EM) \19\, and incremental 
subspace method |34|. Here, we briefly survey these 
three algorithms. 

In template alignment method, |23| proposes a 
heuristic but robust criteria to decide whether to 
update the template at time t. The basic idea is to 
keep the starting template to correct the drift of the 
estimated template. The latest estimated template is 
first matched to the previous updated template. It is 
then warped before checking with the first template. 
For a small template displacement, this method works 
very well. However, by imposing alignment between 
the latest template and the first template, this method 
inherently limit target poses changes to a warping 
model. 

The online EM method (19) employs a mixture of 
three template distributions to account for template 
variations, namely, long term stable template, interframe 
variational template, and outlier template. These tem- 
plates model stable appearance of target, interframe 
changes in appearance of poses, and occlusion or 
outliers respectively. Employing a Gaussian mixture 
model, parameters and membership are estimated on 
the fly using online EM. In this framework, each pixel 
in the target patches is assumed to be independent 
and consequently more stable pixels tend to gain 
more weights in the similarity measure. This could6 



gradually drift the template in the presence of more 
stable background pixels. 

The third algorithm is to represent the target in 
its eigenspace, proposed by |34|. The posterior es- 
timates of the template are collected over an inter- 
val, and these estimates are then analyzed online 
through an Incremental Principal Component Anal- 
ysis method(IPCA). This method can capture changes 
in template variation in eigenbases. The mean of the 
posterior estimates are also kept as stable templates. 
The authors have tested IPCA with various video 
sequences, and demonstrated its great robustness to 
the template variations due to pose changes and illu- 
mination changes. Figure |2] illustrates an incremental 
update of eigenbases and means. The images in the 
3^^ row show how the eigenbases evolve over time. 
It has been shown in the paper that the updated 
templates could almost reconstruct the original im- 
age samples over the sequence, reflecting the ability 
of the eigenbases to model temporal variations. Al- 
though IPCA is often very robust and can track target 
very accurately even in noisy, low contrast image 
sequences, IPCA falls short when the target undergoes 
fast pose changes and dramatic illumination changes 
as stated on the paper. This may be because PCA 
inherently assumes that the target templates over time 
are from a Gaussian distribution. In abrupt changes 
in poses and illumination, this assumption does not 
hold. The unimodal distribution also requires good 
pixel-wise alignment between the posterior estimate 
and eigenbases, otherwise uncertainties in template 
alignment would contribute to template variance and 
may lead to non-informative basis. A good example 
from the paper is shown in Figure |2] One can see 
that from frames #600 to #636, the eigenbases are not 
representative anymore and the tracker loses track of 
the target. 
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(a) Representative (b) Eigenbases of (c) Eigenbases are not 
eigenbases misaligned target representative 

regions 

Fig. 2. Results of incremental subspace method on 
the Sylvester sequence. Pixel-wise misalignment could 
render eigenbasis non-representative. The 1^^ row are 
the sample frames. The 2'^'^ row images are the current 
sample mean, tracked region, reconstructed image, 
and the reconstruction error respectively. The ?>^'^ and 
4^^ rows are the top 10 principal eigenbases. 
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So far, most of the current state-of-the-art algo- 
rithms update templates in an out-of-chain manner, 
by assuming the posterior estimate is "good enough'' 
for template update with pixel-wise alignment. If the 
target poses posterior estimate is inaccurate or there 
is a mis-alignment between the estimated and last 
updated template, the update methods will gradually 
drift. On the other hand, if the template update is not 
good, then the posterior estimate of target poses is 
unlikely to be accurate. These coupled dual problems 
often render these methods unable to track well when 
the targets undergo fast changes in poses or non-rigid 
transformation. However, robustness to fast target 
poses has many real life applications such as human 
tracking, maritime target tracking, etc. 

To solve these dual problems faced by the exist- 
ing state-of-the-art algorithms, |8| introduces a novel 
approach to simultaneously quantify these two uncer- 
tainties by including both of them into the state space 
of a Bayesian framework, instead of just target poses 
in the exist methods. In this manner, no posterior 
estimate is used for updating, instead better matched 
multiple hypothesized templates are propagated au- 
tomatically. 

Paper contributions. To the best of our knowledge, 
almost all the state-of-art algorithms use out-of-chain 
template updating methods. That is to say, the up- 
dating of template model is done after obtaining the 
posterior estimate of the targets position. In this paper, 
we propose a method to update target model in 
tandem with the target kinematics. In other words, 
we model the target template as a part of the state 
space. We choose the covariance descriptor for the 
target descriptor as it is more robust to problems 
such as pixel-pixel misalignment and changes in pose 
and illumination. Since positive definite covariance 
matrices form a Riemannian manifold, we model the 
target template model variation by a random walk 
on the covariance Riemannian manifold. We propose 
a novel superior template propagation mechanism in 
the log-transformed space of the manifold to free the 
constraints imposed inherently by positive semidefi^ 
nite matrices, leading to a greater ability in dealing 
with template variations. Our resultant method out- 
performs the state-of-the-art Incremental PCA algo- 
rithm 1 34 1 in dealing with fast moving and changing 
targets, as will be clearly shown in the experiments 
section. 

The paper is organized as follows: Section [2] gives 
a brief introduction to both covariance descriptors 
and Riemannian manifold. Section |3] gives a Bayesian 
formulation of simultaneous inference of both target 
kinetics and template posterior distribution. Section |4] 
analyzes the template generative process. In Section 
|5j we empirically compare our results with IPCA and 
give a short discussion. Finally, section [6] concludes 
this paper. 



2 Target Covariance Descriptor 

In this section, we explain the motivation of using co- 
variance descriptor and its operation on Riemannian 
manifold. 



2.1 Covariance Descriptor 

A covariance descriptor is defined as follows: 



1 ^ 

^ = ]Nr3i E (/«-/)(/«-/) ^ 



(2) 



where / is a feature vector, / = Y^f^iif {i)) is 
the mean of the feature vector over N pixels in the 
target region. In this paper, we use the following 9- 
dimensional feature vector: 



arctan 



"7 \^XXnj \ l l^yVu 



■ (3) 



They are x,y coordinates, pixel intensity, x,y direc- 
tional intensity gradients, gradient magnitude and 
angle, and second order gradients respectively, w 
denotes that these features are extracted after warping 
image patches to a standard size. 

Since its proposed use in human detection [ |4Q| , 
covariance descriptor has gained popularity for many 
applications, such as face recognition p6), license 
plate detection |30|, and tracking |31|, |45|. Some main 
advantages of choosing the covariance descriptor |42| 
to model the template include its lower dimensional- 
ity of ^ {(f + d) (45 in this paper as d = 9), compared 
to its number of target pixels (32 x 32 = 1024 in this 
paper), its ability to fuse multiple possibly correlated 
features, and its robustness to match targets in differ- 
ent views and poses. 

By its definition, covariance matrix is clearly a posi- 
tive semi-definite matrix, which lies on a Riemannian 
manifold. We will now briefly explain some basic 
operations on the Riemannian manifold. 

2.2 Riemannian Manifold 
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Fig. 3. The geodesic distance is the norm of a vector 
on the tangent space Tc^M of at point Ci on the 
f33anifold M 
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TABLE 1 

Operations in Euclidean and Riemannian spaces 

Euclidean space Riemannian manifold 

CiCj — Cj — Ci CiCj = \og(j^{C j) 

dist(C„C- j) = \\Cj-C4 dist(C,,C,-) = WC^jWc, 
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A Riemannian manifold is a differential mani- 
fold and each of its tangent space T^M has a metric 
function g which defines the dot products between 
any two tangent vectors yk-,yi - The covariance descrip- 
tor is a point on the manifold M, the following oper- 
ations can be applied to it. The Riemannian metric: 



{yk.yi)c. = trace [c^'ykC-'yiC^ ^) . 



(4) 



The exponential map exp^ : Tc^M M, takes a 
tangent vector at point Ci and maps to another point 

= expc. (y) = Ci exp (cT ^CT ^ ) Q ^ (5) 

The inverse of the exponential map is the logarithm 
map, which takes a starting point d and destination 
Cj, maps to the tangent vector y at point Ci. 



Ci^\og(c-"-CjC-"-)Ci^. (6) 

Finally, the distance between two covariance matrices 
d and Cj is given as: 



d{Ci,Cj) = 



^In^ Afc id, Cj), 



(7) 



k=l 



where Xk{Ci,Cj) are the generalized eigenvalues of 
Ci and Cj. That is, XkCiVk — CjVk = 0, and d is the 
dimension of the covariance matrices. 

Note that expQ(-) and \ogc^{-) are maps on the 
Riemannian manifold, whereas exp(-) and log(-) de- 
note the normal matrix exponential and logarithmic 
operations. Both ex.-p^.{y) and tangent vector y are 
both d X d matrices in this paper. 

2.3 Motivation of Manifold Modeling 

High dimensional image data often lies in low di- 
mensional manifold. For an example, a collection of 
rotated handwriting zeros in Figure |4] lie in a dimen- 
sion of 28 X 28 = 784 using vectorized representation, 
but have only one rotational parameter. Popular di- 
mensional reduction methods such as ISOMAP |39|, 
eigenmap LLE |35| model data using manifold 
structures. In visual tracking, the target patches in 
the image sequence are implicitly bounded by the 
target's degree of freedom captured by images, such 
as rotation, translation, scaling etc. These implicit 
parameters modeled using low dimensional manifold 
could capture image distance. The simplest and yeto 



Fig. 4. A collection of rotated handwriting zeros by the 
angle of 45° each time. 




(a) Targets 



(b) Backgrounds 



(c) SVM re- 
sults 




(d) Euclidean distance 



(e) Manifold distance 



Fig. 5. An illustration of distance between images 
can be better modeled on a manifold, a sequence 
of dancing penguin from youtube. In (d), Euclidean 
distance between targets is larger than the distance 
between target and background; and not on the man- 
ifold space. Furthermore, in (e), SVM cannot linearly 
separate the targets from backgrounds in vectorized 
Euclidean space. 



most popular distance measure between images is 
Euclidean distance between the vectorized images. 
A simple example of the head of dancing penguin 
from youtube is adopted to illustrate in Figure |5] that 
the manifold of covariance descriptor can model the 
image distance better and can separate target patches 
from the background better. Using Euclidean distance, 
ifae distance between target patches and first target 
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271 patch could be larger than the one between the back- 

272 ground and the first target patch. Furthermore, a test 

273 of support vector machine using linear kernel showed 

274 that some background patches are classified into the 

275 target patches. On the other hand, the separation 

276 between target patches and background patches were 

277 separated shown in Figure |5^. 

278 3 Bayesian Framework 

279 In this section, we use a standard Bayesian framework 

280 p3| to formulate tracking of both template and kinet- 

281 ics as follows: 

282 P(C,,S,|^i:,) (X P{Zt\CuSt) J P{SuCt\St-uCt-l) 

l^l P{Ct-i,St-i\zi:t-i)dst-idCt-i, (8) 

285 where zt is the measurement, is the kinetic state 

286 variables, Ct is the covariance descriptor, P{Ctj st\zi:t) 

287 is the posterior probability of target template and pose 

288 given the measurement, P{zt\Ct,St) is the observa- 

289 tional model, and P(st , | _ i , Ct_ i ) is the dynamical 

290 model. They are further elaborated in the following 

291 subsections. 



G Tc^M is a random process on the tangent plane 
of manifold M. An example of this could be the 
Brownian motion process as described by |17|. In this 
paper, we choose to model the template dynamical 
model in log-transformed space of the manifold as 
follows: 



Ct = exp(log(Ct_i) + wt) 
where wt ^ iV(0,i;), T^ij 



P{\og{Ct)) a exp 



E 

<j,ije[i,d] 



(14) 



(15) 



where Wt is simply a random symmetric matrices and 
7V(0, cr|^), z, j G [l,d] are normal distributions. Ac- 
cording to 1 1 1, the matrix exponential function maps a 
symmetric matrix to its corresponding positive semi- 
definite, exp : Sym{d) Sym~^{d), and it is one-to 
one mapping. As such, the generated samples of Ct 
is always a positive semi-definite (PSD) matrix. This 
frees the inherent constraints of positive eigenvalues 
in a PSD matrix. This distribution may be considered 
as a log-normal distribution of the PSD matrices as 
defined in | ,36J . 
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3.1 Dynamical Model 

The state space in our paper includes both target 
kinetic variables and template covariance descriptor 
Ct. The state variables are defined in Eqns. ^ and 
jTo] ), and we would like to estimate them through the 
Bayesian framework in Eqn.([8]). These state variables 
are propagated from time t — 1 to t through a dynam- 
ical model P{st,Ct\st-i,Ct-i) • 



St = [xt,yt,xt,yt,ht,9t], 
Ct = cov{x^,y^J{x^,y^), \Iy^ 



arctan - 



n 



p 



(9) 



(10) 



where Xt.yt are the spatial coordinates of the target 
center position at time t, xt, yt are the velocities, ht 
is the scaling factor, and 9t is the orientation, x^^ , y^j 
are the coordinates of a pixel on the standard target 
patch warped from xt-, ytr I{xw^yw) is the pixel inten- 
sity and {Ix^ , ly^ } are the patch intensity gradients, 
{Ixx^ , lyy^ } are the second order gradients. Assuming 
independence between kinetic variables and covari- 
ance, we model the joint dynamics as follows: 



P{st,Ct\st-i,Ct-i) = P{st\st-i)P{Ct\Ct-i), (11) 
St = k{st-i) ^ut, (12) 
Ct = expc,_,{^t)- (13) 

k is the kinetic model and we use a near constant 
velocity linear model k{st-i) = Ast-i. ut is generated 
with an interacting Gaussian models with a jump- 
ing probability of [0.9,0.1] to model sudden changes 
in target poses. As for template dynamical modd^ 



Ct^Ct-i 

= exp (- log(Ct_i) - Wt) exp (log(Ct_i)) 
exp{-wt) 

Generalized Eigenvalues: 

XkCtv - Ct-iv 
C^'^Ct-iv = XkV 
exp{-wt)v XkV 

- d 

d{Ct-uCt) 



(16) 



^ [In^ Xk {exp{-wt))] 

1/2 



■ d 



1/2 



(17) 



, if 3^ 



In this paper, for d = 9, Wt's eigenvalues Xi{wt) > 
X2{m) > ••• > Xg{wt) can be bounded according 
to |l47(, assuming the entries of the noise matrix are 
bounded by [a, 6], i.e. a < Wt{i^j) < b: 




Va^TSOF) \a\ < b 



^ (96 + 
9b 



8062) 



otherwise. 

\a\ < b 
otherwise. 



(18) 



(19) 



In other words, the eigenvalues are roughly within 
an order of magnitude of max(cri,j) for this random 
process. In this way, the template diffusion spread on 
the manifold can be easily managed by choosing an 
appropriate max(cri,j) in wt- 
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Fig. 6. Evolution of template and generated templates, 
frame #1, 101, 201, 501, 801, Green cross: top 10 
similar generated templates. Red: ground truth. Blue 
cross: the background. 



3.3 Overall Framework 

We use a standard particle filter to do sequential 
inference. The particle filter |5|, ||33|, jl^ represents 
the distribution of state variables by a collection of 
samples and their w^eights. The advantage of using 
a particle filter is that it can deal w^ith non-linear 
system and multi-modal posterior. The algorithm of 
the particle filter is as follows: 

1) Initialization.The particle filter is initialized 
w^ith a known realization of target state vari- 
ables. This includes the target initial state values. 
Covariance of the target Co, i.e. initial template 
is extracted for comparison later. The parameters 
of covariance generative process, i.e. template 
dynamical model are also determined. 

2) Propagation. Each particle is propagated accord- 
ing to the propagation model in Eqns. (H) and 
jT4| ). Both kinetic variables and template are 
generated through these random processes. 

3) Measure the likelihood. At each particle i, the 
covariance descriptor C^{i) extracted is com- 
pared to its corresponding template Ct{i)- The 
likelihood of the particle is then estimated as 
given in Eqn. ( |2T| . 

4) Posterior estimation. The posterior estimate 
gives the estimate of the current target state, 
given all its previous information and measure- 
ments. This could be maximum a posteriori 
probability estimate or minimum mean square 
error estimate (MMSE). In this paper, we use 
MMSE. 

5) Resampling. To avoid any degeneracies, resam- 
pling is conducted to redistribute the weight of 
particles. 

6) Loop. Repeat the process from step 2 to 5 as time 
progresses. 



4 Analysis of Template Generative 4io 

PROCESS 
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3.2 Observation Model 
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The observation model P{zt\Ct, St) measures the like- 
lihood of a target given target poses and template 
values, it is modeled as follows: 



P{zt\CuSt)^N{0,a^), 
Zt = d{CuCt), 
Ct =^(st, Image), 

P{zt\Ct,St) (xexp(-^(i^) 



(20) 



(21) 
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Here, d{Ct, C^) is given by Eqn. 0. g is the covariance 
computation operator; g takes the kinetic value St 
of each particle at time t, warps the region to a 
standard size (in this paper, 32 x 32) before computing 
covariance. 
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Fig. 7. Visualization of target in"soccer" sequences 
on the covariance manifold, Red:target patches, Blue: 
l^^ckground patches. 
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In this section, we show that the covariance descrip- 
tor is a good representation of the target as well as 
the motivation behind performing a random walk as 
given in Section [3J^ Two reasonable criteria for a good 
target representation are as follows: 

• the representation evolves gradually as the target 
undergoes changes in poses, appearance etc, 

• there is clear separation of target and back- 
ground. 

To help visualize the distribution of target covariance 
matrices on the manifold, we use multidimensional 
scaling |2T| to construct a visualization of the dis- 
tribution of the covariance matrices. The distance 
matrix is constructed using Riemannian distance as 
given in Eqn.([7j. The visualization shows the relative 
positions of targets (red) and backgrounds(blue). Vi- 
sualizing the PETS 2003 Soccer sequence and Dudek 
Face sequence in Figures [7| and we noticed that 
our representation of the targets tended to cluster 
together as they evolved gradually. This evolution is 
smoother and easier to model on the manifold as com- 
pared to the evolution of its original feature values at 
each pixel. This observation motivated us to model 
the template variations by using a random walk on 
the Riemannian manifold. Based on Eqns.jT2|) and 
jT4| ), Figure 6] illustrates a realization of the random 
walk. This shows that our template dynamical model 
can model the actual target appearance variations. 
Changes in facial expression and face poses cause 
covariance template (shown as red points) to evolve 
slowly on the manifold, and they are well modeled 
by the generated covariances on the manifold (shown 
as green points). 



5 Experiments and Results 

5.1 Experimental data 

We tested our algorithm on some of popular tracking 
datasets, David Ross's sequences including plush toy 
(toy Sylv), toy dog, david, car 4sequences from his 
website, Dudek Face sequences, and vehicle track- 
ing sequences from PETS2001, soccer sequence from 
PETS2003. The test data information is tabulated in 
Table El 

5.2 Performance measure 



As spelled out in ||22j, a good measure should include 
both overall tracking and goodness of track. This 
paper uses the ratio between on-track length and 
sequence length to capture the performance of overall 
tracking, and on-track accuracy for goodness of track. 
Define tracking errors as: ex{t) = \\gx{t)—x{t)\\^ ey{t) = 
\\gy{t) - y{t)\\, where ex(t),ey{t), gx{t), gy{t) are the 
errors in x^y and ground truth in x^y at time sfe 



respectively. 



'^ontrack 



'^ontrack 



exjt) ^ ey{t) 

Hy{t) 



2 \H,{t) 

'Jontrack 



< 1 



^^^ontrack 



1/2 



Hy{t) 



1/2 



(22) 
(23) 

(24) 



Hx{t),Hy{t) are the ground truth target size at time 
t. In this work, ground truth on the target center is 
manually annotated, the target size is assumed to as 
those of the first frame (this may not be applicable to 
frames with a large change in target size). 

5.3 Results and discussion 

We compared our method with the current state- 
of-the-art algorithm, the incremental PCA (IPCA) 
method by David et al ||34|. Our results are shown 
in red and the IPCA in green from Figures [9] to 15 
In PLUSH TOY SYLV sequences shown in Figure 
[9jthe IPCA failed to recover tracking from frame #609 
when it locked onto the background, which looks 
more similar to the upright SYLV. Fast poses changes 
around frame #609 caused the IPCA eigenbases non- 
representative as shown in Figure 4. 

Similarly, in Figure [lOj the IPCA failed to follow 
through when target underwent a fast motion towards 
the frame #1351. This shortcoming of the IPCA is 
better reflected in Soccer Sequences of PETS2003. the 
IPCA started to drift off from frame #628 shown in 



Figure 11 when the player moved his legs fast, and 
lost track shortly. In the same sequence in Figure 12 
the IPCA found it hard to track the opposite team 
players who wore dark clothes after a short occlusion 
at frame #285. 

In Figure [l3j Dudek Face sequences, both methods 
perform well despite of his rich facial expressions, 
which have more effects on our covariance descriptor. 
In the more stable vehicle sequence from PETS2001 
in Figure 14 again both methods could track well. 
Figure 15 shows an example of a car sequence, in 
which our method did not perform satisfactorily. Our 
method locked onto the background whereas the 
IPCA showed robustness to the illumination changes. 
The possible explanation is that our template dynam- 
ics was unable to account for this dramatic and non- 
smooth transition of the template when the car went 
into a shadowed region. Also, a closer look showed 
that the IPCA eigenbasis looked similar to the target 
template in shadows. 

The overall tracking performance on the test cases 
is summarized in Figure [8] Note that images se- 
quences of Sylv, PETS2001 and soccer player 4 
have targets out of the images, this explained the 
small track duration performance. Nevertheless, our 
method shown in red generally had longer track 
l^gth. On the hand, given frames that were on 
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TABLE 2 
Test Sequences 



Test sequences 


Source 


No. of frames 


Characteristics 


Plush Toy (Toy Sylv) 


David Ross 


1344 


fast changing, 3D Rotation, Scaling, Clutter, large movement 


Toy dog 


David Ross 


1390 


fast changing, 3D Rotation, Scaling, Clutter, large movement 


Soccer player 1 






Soccer player 2 
Soccer player 3 


PETS 2003 


1000 


Fast changing, white team, good contrast with background, occlusion 


Soccer player 4 






Fast changing, gray(red) team,poor contrast with background, occlusion 


Dudek Face Sequence 


A.D. Jepson 


1145 


Relatively stable, occlusion, 3D rotation 


Truck 


PETS 2001 


200 


relatively stable, scaling 


David 


David Ross 


503 


relatively stable 2D rotation 


Car 4 


David Ross 


640 


Relatively stable, scaling, shadow, specular effects 



in 



/ / / / / .c,. ; 



^e<-? .e*-^" 



(a) Track duration rate rontrack 

Fig. 8. The results statistics, our results in blue, IPCA in red. 



(b) Track accuracy rmsontrack 



516 track for both trackers, IPCA showed better track 

517 accuracy shown in Figure [Sj^. For the sequences with 

518 frequent changes in target appearance such as soc- 

519 cer sequences, the track goodness was comparable. 

520 The video sequences may be found on the website, 

521 http: / / www.youtube.com/ watch?v=KaSrVbGyvq4 

522 Discussion. In stable tracking cases, good pixel-wise 

523 alignment enabled the IPCA to track very well. The 

524 IPCA was generally very robust to blurring, even illu- 

525 mination changes, as eigenbasis tended to encompass 

526 these changes. In other words, some eigenbasis looked 

527 similar to blurred or illumination-changed templates. 

528 The distance measure in the IPCA uses a norm of all 

529 corresponding pixels difference; as such, it tends to be 

530 very stable and well aligned in the stable target cases. 

531 On the other hand, it is likely to favor the relatively 

532 stable regions in the target. When such regions are too 

533 similar to the background and target poses changes 

534 at the same time, then the IPCA may lose track very 

535 quickly in the Soccer Sequence in Figure [12] On the 

536 other hand, our method uses covariance of gradients 

537 and intensity; the template feature descriptor is much 

538 smaller in dimension. This may cause our method 

539 slightly less precise than the IPCA shown in Figure 

540 [lO| which our method did not match to pixel accuracy. 

541 Figur^Ts} our method lost track when the vehicle 
entered the shadowed region, because the both graes 



dients and intensity changed significantly and for an 543 

interval. 544 

Although our method was slightly not as precise 545 

in the stable cases, it gain much more flexibility in 546 

the non-stable tracking scenarios. In the cases of non- 547 

rigid or fast motion of targets, mis-alignment in the 548 

posterior estimate (the new template sample to add 549 

to the eigen space in the IPCA) and eigenbases may 550 

accumulate over a short interval and consequently 551 

render eigenbases non-representative at all. This in- 552 

evitably leads to loss in tracking. Our method could 553 

deal with these scenarios a lot better for two rea- 554 

sons. Firstly, the template descriptor did not require 555 

pixel-wise alignment and is robust to mis-alignment. 556 

Secondly, the generative process could accommodate 557 

multiple hypothesis of the template on the covariance 553 

Riemannian manifold, and it automatically selects the 559 

better hypothesis as the target template evolves as 56o 

shown in Figure |6] 56i 

However, there are some limitations in our algo- 562 

rithm. One of them is to the need to careful choose a 563 

suitable region for tracking. Since we used the pub- 564 

lished features such as intensity and gradients, and 565 

second order gradients for covariance, these features 566 

are sensitive to specular effects, dark shadows as 567 

shown in Figure [l5j It is also important to choose 568 
a2 target region with fairly good gradients varia- 
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Fig. 9. Tracking results on PLUSH TOY SYLV 
sequences, frame #133, 594, 609, 613, 957, 1338, 
Green: IPCA, Red: our results. The IPCA failed to 
recover track from frame # 609. 

tions, otherwise the covariance descriptor may be ill- 
conditioned consequently affecting both eigenvalues 
estimation and distance measurements in Equation 
0. 



6 Conclusion 

In this paper, we have proposed a new method to 
update target model in tandem with the target kine- 
matics. More precisely, we have developed a gener- 
ative template model in a principled way within a 
Bayesian framework. A novel template propagation 
mechanism in the log-transformed space of the co- 
variance manifold to free the constraints inherently 
imposed by positive definite matrices. We have shown 
that the simple generative process can allow template 
to evolve naturally with target appearance variation. 
It is hoped that by jointly quantifying the uncertainties 
of the target kinematics and template, we are able to 
achieve more robust visual tracking. We have chosen 
the covariance descriptor as the target representation. 
We have modeled the target template model dynamic 
using a random walk on the covariance Riemannian 
manifold. Our template dynamic model is an example 
of a diffusion process on the covariance Riemannian 
manifold. In the experiments, our algorithm outper- 
formed with the current state-of-the-art algorithm 
IPCA particularly when the target underwent a fast 
and non-rigid poses changes, and also maintained a 
comparable performance when the target was more 
stable. Some future work includes automatic selectioias 
of covariance features that are more robust to a sud- 
den dramatic change in illuminations. 

Future work includes addressing a number of ques- 
tions such as how should the diffusion speed be 
adjusted and can the diffusion process be better con- 
strained. Another area of work is to deal with illu- 
mination changes in the manifold generative process. 
In order to improve the goodness of track, a more 
discriminative target descriptor is to be explored. 




Fig. 10. Tracking results on toy dog sequences, frame 
#1, 450, 715, 1014, 1271, 1351, Green: IPCA, Red: 
our results. The IPCA was slightly more localized in 
stable case, but failed to follow through when the target 
underwent a fast motion towards frame #1351 . 




Fig. 1 1 . Tracking results on soccer sequences, frame 
#246, 628, 630, 661, 686, 996, Green: IPCA, Red: 
our results. The IPCA started to drift off from frame 
#628 when the player's legs moved fast, and lost track 
shortly. 




Fig. 12. Tracking results on soccer sequences, frame 
#10, 15, 122, 248, 285, 360, Green: IPCA, Red: our 
results. The IPCA started to drift off from frame #1 5 due 
to low contrast between the target and the background. 
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